Some Properties of Optimal Cartesian Product Files for Orthogonal Range Queries

نویسندگان

  • Annie Y. H. Chou
  • Wei-Pang Yang
  • Chin-Chen Chang
چکیده

Cartesian product file (CPF) has been proposed as a good multi-attribute file structure. Although designing an optimal CPF for partial match queries (PMQs) has been proven to be NP-hard, some useful properties have been studied for PMQs to help the work. However, a good CPF for PMQs may not be beneficial for orthogonal range queries (ORQs). Therefore, in this paper, we intend to study properties that help the design of a good CPF for ORQs. We found that the problem of designing the optimal CPF for ORQs is related to the problem of finding a minimal-f N-tuple. We will also show some theories of minimal-f N-tuples and develop a method for generating a minimal-f N-tuple. Finally, we will present some properties of the optimal CPF for ORQs from the theories of minimal-f N-tuples. 1. I N T R O D U C T I O N The need for efficient retrieval methods in large databases has increased recently. Files that store large amounts of data are often used in many applications. In such applications, the size of data is usually too large to be wholly stored in primary memory. In practice, files are usually divided into buckets and then stored in a disk. Therefore, the distribution of a file over multiple buckets directly influences the performance of retrieval. INFORMATION SCIENCES 90, 91-107 (1996) © Elsevier Science Inc. 1996 655 Avenue of the Americas, New York, NY 10010 0020-0255/96/$15.00 SSDI 0020-0255(95)00284-7 92 A . Y . H . CHOU ET AL. A multi-attribute file (MAF) is a collection of records characterized by more than one attribute. An orthogonal range query (ORQ) is a query of the following form: retrieve all records satisfying (x~, x 2 . . . . . XN), where x i is either a range [l~,ui] contained in the domain of the ith attribute, or is unspecified (denoted by *). For instance, in a two-attribute file system {a, b} × {1,2,3,4}, (*,[1,3]) denotes an ORQ which retrieves all records with the first attribute being either "a" or "b" and the second attribute being in the range [1,3]. If l~=u i for all specified x~, then the ORQ reduces to a partial match query (PMQ). Since each time the disk is accessed, an entire bucket is brought into primary memory, we can simply measure the retrieval performance of a file structure by counting the average number of buckets (ANB) to be examined over all possible queries. Therefore, the optimal MAF system design problem can be stated as follows: given a set of multi-attribute records, and a fixed number of buckets, arrange the records into buckets such that the ANB to be examined over all possible queries will be as small as possible [3]. In prior research work, great progress has been achieved on designing optimal MAF systems for PMQs. Tang, Buehrer, and Lee [14] showed that this problem is NP-hard, and many heuristic methods have been proposed to solve it, for example, the string homomorphism hashing (SHH) method by Rivest [12], the multi-key hashing (MKH) method by Rothnie and I_ozano [13], the multi-dimensional directory (MDD) method by Liou and Yao [11], the optimal Cartesian product file (CPF) design method by Chang, Du, and Lee [4], the Greedy File design method proposed by Chou, Yang, and Chang [9], and so on. Recently, the optimal disk allocation for PMQs was also proposed by Abdel-Ghaffar and E1 Abbadi [1]. However, the study of that for ORQs is far less progressive [6-8]. In our work, we paid particular attention to the design of Cartesian product files (CPFs) for ORQs. The CPF is a hashed file in essence, and the choice of hash functions (i.e., partition forms) determines the performance of retrieval. Chou et al. have derived the performance formula of CPFs for ORQs in [8]. However, as we should see, the performance formula is too complicated to apply. In this paper, we intend to find some properties to determine which partition form is better, instead of directly calculating their performance. Chang et al. [3] presented some properties of CPFs for PMQs and showed that the problem of designing the optimal CPF for PMQs is related to the problem of finding a minimal N-tuple, where the N-tuple exactly corresponds to the partition size form of a CPF. In this paper, we intend to study whether the problem of designing the optimal CPF for ORQs is also related to the problem of finding a minimal N-tuple. O R T H O G O N A L R A N G E QUERIES 93 Fortunately, we are able to show that the answer is yes, although the N-tuple now corresponds to the partition fo rm of a CPF. Besides, there exist exceptions in which the minimal N-tuple is not always the optimal partition form of a CPF for ORQs. Therefore, we define a new term, min ima l f N-tuple, to cover minimal N-tuples that describe the properties of optimal CPFs for ORQs. Then, we will also show some theories of minimal-f N-tuples and develop a method for generating a minimal-f N-tuple. Finally, we will present some properties of the optimal CPF from the theories of minimal-f N-tuples. In the next section, we will review the CPF concept and its performance formula for ORQs. In Section 3, we will define a minimal-f N-tuple, then derive some theories of minimal-f N-tuples and propose an algorithm for finding a minimal-f N-tuple. In Section 4, the problem of designing the optimal CPF for ORQs is shown to be related to the problem of finding a minimal N-tuple with reverse order (defined later) or finding a minimal-f N-tuple. Finally, conclusions are given in Section 5. 2. A REVIEW OF CPFS FOR ORQS The CPF concept was originally proposed by Lin, Lee, and Du [10]. They also pointed out that file systems designed using the SHH, MKH, and MDD methods are all CPFs. The CPFs are defined as follows. DEFINITION 2.1 [10]. An N-attribute CPF is a file in which each domain D i is divided into m i equal-sized subdomains DipDi2 . . . . . Dim,, and all records in each bucket are of the form Dxs ' XD2s2X " " XDNsN, where l<~si<~m ~ and l < ~ i ~ N . The partition form of this CPF is denoted as ( m l , m 2 . . . . . m N ), where m i is the number of partitions in the ith domain. The partition size fo rm of this CPF is denoted as (z 1, z 2 . . . . . zN), where z i = [Di l /m i is ith subdomain size and is an integer. To measure the performance of CPFs for ORQs, we have derived the following formula to directly evaluate the ANB of a CPF over all possible ORQs in [8]. Let F be an N-attribute CPF with partition form ( m l , m 2 . . . . . m N ) and d~ be the domain size of the ith attribute. If the probabilities of all occurring queries are equal, then the ANB of this CPF over all possible ORQs is as follows. IltN ( (d )2 ) _ i + 3d i 3d~ +dZ i + 6 A N B ( O R Q ) = ~ i=, ~ mi + mi ' 94 A . Y . H . CHOU ET AL. where NB is the number of buckets=FIU=lmi, and NOQ is the total number of ORQs = (1⁄2)NFIN l(d 2 +d i + 2). EXAMPLE 2.1. Consider a two-attribute file with D 1 = {a,b} and D 2 = {1, 2, 3, 4, 5, 6}. Suppose we have six buckets, and each can hold two records. If Dll={a,b}, D2l={1}, D22={2}, D23={3}, D26 = {6}. Then the following file F is a CPF.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Optimal Disk Allocation Strategy for Partial Match Queries on Non-Uniform Cartesian Product Files

The disk allocation problem addresses the issue of how to distribute a file on to several disks to maximize the concurrent disk accesses in response to a partial match query. In the past this problem has been studied for binary as well as for p-ary cartesian product files. In this paper, we propose a disk allocation strategy for non-uniform cartesian product files by a coding theoretic approach...

متن کامل

Scalability Analysis of Declustering Methods for Cartesian Product Files

Efficient storage and retrieval of multi-attribute datasets has become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multi-attribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files over multiple disks to obtain high perfo...

متن کامل

Scalability Analysis of Declustering Methods for Multidimensional Range Queries

Efficient storage and retrieval of multiattribute data sets has become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multiattribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files across multiple disks to obtain high perf...

متن کامل

The Merrifield-Simmons indices and Hosoya indices of some classes of cartesian graph product

The Merrifield-Simmons index of a graph is defined as the total number of the independent sets of the graph and the Hosoya index of a graph is defined as the total number of the matchings of the graph. In this paper, we give formula for Merrifield-Simmons and Hosoya indices of some classes of cartesian product of two graphs K{_2}×H, where H is a path graph P{_n}, cyclic graph C{_n}, or star gra...

متن کامل

Optimal Linear Hashing Files for Orthogonal Range Retrieval

In this paper, we are concerned with the problem of designing optimal linear hashing files for orthogonal range retrieval. Through the study of performance expressions, we show that optimal basic linear hashing files and optimal recursive linear hashing files for orthogonal range retrieval can be produced, in certain cases, by a greedy method called the MMI (minimum marginal increase) method; a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Inf. Sci.

دوره 90  شماره 

صفحات  -

تاریخ انتشار 1996